Linear regressions in R assume a linear relationship between one or more predictor variables \(X_i\) and a response variable \(Y\).
\[y_i \sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i = \beta_0 + \beta_1 x_i\]
The first element of this model describes how the response variable \(Y\) can be modeled as a normal random variable with mean and standard deviation.
The second element of this model adds the effect of the predictor variable \(X\) on \(Y\), based on the causal relationship \(X \rightarrow Y\). In this simple case, the relationship is linear and can be described by the equation of a line with intercept \(\beta_0\) and slope \(\beta_1\).
In the introductory lecture for linear models, we used an example dataset to fit a simple regression model.
The dataset itself was generated according to the specifications of the model above. We are going to recreate the dataset and some of the calculations presented during the class.
set.seed(4) #ensures the result is reproducible. The random number generation can vary between Operative systems so maybe there will be different datasets in the classroom
x_i <- seq(1, 5, by = 0.5)
n <- length(x_i)
mu <- 1.2 + 3.5 * x_i
y_i <- rnorm(n = 9, mean = mu, sd = 3)
lm_data <- data.frame(x_i, mu, y_i, res = y_i - mu)
Create a plot for the dataset
plot(y_i ~ x_i)
As you can see, plotting the data suggests a linear relationship between these two variables
Let’s calculate the coefficients for the regression line according to the Ordinal Least Squares method:
\[\widehat{\beta_1}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]
and
\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]
x <- x_i
y <- y_i
beta1_hat <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x)) ^ 2)
beta1_hat
[1] 3.84557
[1] 1.459434
Add a regression line with the values of the coefficients you calculated
When we generated the dataset (here), \(\mu\) was specified as:
mu <- 1.2 + 3.5 * x_iThis means \(\beta_0 = 1.2\) and \(\beta_1 = 3.5\). How do these values compare to the coefficients you just calculated?
Add a population line to the plot you previously created by completing the code below:
plot(y ~ x)
abline(a = beta0_hat, b = beta1_hat, col = "red")
abline(a = ..., b = ..., col = "blue")See the code by clicking Show code:
Function lm() fits regression models using Ordinal Least Squares.
The values of the coefficients can be found using function coef().
Compare this result with the calculations you did previously.
\(SS_y\), \(RSS\), and \(SS_{Reg}\)
\[RSS = \sum_{i=1}^{n}(y_i-\hat{y}_i)^2 \]
\[SS_y = {\sum_{i=1}^{n}(y_i-\bar{y})^2}\]
\[SS_Y = SS_{Reg} + RSS\]
Challenge: Given that you have already learned that the \(R^2\) of a model is the ratio between the sums of squares of the regression and the total sum of squares, calculate it and check the result comparing with the summary of the model.
The assumptions of the linear regression model are:
The linear model correctly describes the functional relationship between \(X\) and \(Y\);
The variable \(X\) is measured without error, therefore, allowing us to associate the error component entirely to the \(Y\) variable;
\(Y\) values are independent with normally distributed errors;
Variances are constant along the regression line.
The fourth assumption allows us to use a single constant \(\sigma\) for the variance. Non-constant variances can be recognizing using diagnostic plots. Let’s use them.
If the linear model fits the data well, this residual plot should exhibit a scatter of points that approximately follows a normal distribution and are uncorrelated with the fitted values. Values under the dashed line are overestimates of the model, while values over the line, underestimates.
In order to check the assumption of homoscedasticity, you should observe if the average size of the residuals increase with the increase in \(X\) (i.e., heterocedasticity).
Another visualization for model diagnostic are the plots of the object containing the fitted model. There are four plots, and we will change the graphical window so we can see all of them at the same time. As Michael J. Crawley would say:
It takes experience to interpret these plots, but what you do not want to see is lots of structure or pattern in the plot.
The first plot, is basically the same as the last plot you created. As we saw in the last plot, it is a problem if the variance in the residuals increase as the fitted values get larger, therefore testing our fourth assumption. The second plot is a normal quantile-quantile plot. This plot should be a straight line if the errors are normally distributed (our third assumption). The third plot is very similar to the first one, but the y-axis is at a different scale - now all values are positive because of the squared root transformation. The last plot indicates influential points, i.e., points that have biggest effect on parameter estimates.